Search results for "Single-linkage clustering"

showing 10 items of 17 documents

Quantum clustering in non-spherical data distributions: Finding a suitable number of clusters

2017

Quantum Clustering (QC) provides an alternative approach to clustering algorithms, several of which are based on geometric relationships between data points. Instead, QC makes use of quantum mechanics concepts to find structures (clusters) in data sets by finding the minima of a quantum potential. The starting point of QC is a Parzen estimator with a fixed length scale, which significantly affects the final cluster allocation. This dependence on an adjustable parameter is common to other methods. We propose a framework to find suitable values of the length parameter σ by optimising twin measures of cluster separation and consistency for a given cluster number. This is an extension of the Se…

0301 basic medicineClustering high-dimensional dataMathematical optimizationCognitive NeuroscienceSingle-linkage clusteringCorrelation clustering02 engineering and technologyComputer Science ApplicationsHierarchical clusteringDetermining the number of clusters in a data set03 medical and health sciences030104 developmental biologyArtificial Intelligence0202 electrical engineering electronic engineering information engineeringCluster (physics)020201 artificial intelligence & image processingQACluster analysisAlgorithmk-medians clusteringMathematicsNeurocomputing
researchProduct

Fast dendrogram-based OTU clustering using sequence embedding

2014

Biodiversity assessment is an important step in a metagenomic processing pipeline. The biodiversity of a microbial metagenome is often estimated by grouping its 16S rRNA reads into operational taxonomic units or OTUs. These metagenomic datasets are typically large and hence require effective yet accurate computational methods for processing.In this paper, we introduce a new hierarchical clustering method called CRiSPy-Embed which aims to produce high-quality clustering results at a low computational cost. We tackle two computational issues of the current OTU hierarchical clustering approach: (1) the compute-intensive sequence alignment operation for building the distance matrix and (2) the …

Brown clusteringCURE data clustering algorithmSingle-linkage clusteringCorrelation clusteringCanopy clustering algorithmData miningBiologyHierarchical clustering of networksCluster analysiscomputer.software_genrecomputerHierarchical clusteringProceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
researchProduct

Structural clustering of millions of molecular graphs

2014

We propose an algorithm for clustering very large molecular graph databases according to scaffolds (i.e., large structural overlaps) that are common between cluster members. Our approach first partitions the original dataset into several smaller datasets using a greedy clustering approach named APreClus based on dynamic seed clustering. APreClus is an online and instance incremental clustering algorithm delaying the final cluster assignment of an instance until one of the so-called pending clusters the instance belongs to has reached significant size and is converted to a fixed cluster. Once a cluster is fixed, APreClus recalculates the cluster centers, which are used as representatives for…

Clustering high-dimensional dataFuzzy clusteringTheoretical computer sciencek-medoidsComputer scienceSingle-linkage clusteringCorrelation clusteringConstrained clusteringcomputer.software_genreComplete-linkage clusteringGraphHierarchical clusteringComputingMethodologies_PATTERNRECOGNITIONData stream clusteringCURE data clustering algorithmCanopy clustering algorithmFLAME clusteringAffinity propagationData miningCluster analysiscomputerk-medians clusteringClustering coefficientProceedings of the 29th Annual ACM Symposium on Applied Computing
researchProduct

Incrementally Assessing Cluster Tendencies with a~Maximum Variance Cluster Algorithm

2003

A straightforward and efficient way to discover clustering tendencies in data using a recently proposed Maximum Variance Clustering algorithm is proposed. The approach shares the benefits of the plain clustering algorithm with regard to other approaches for clustering. Experiments using both synthetic and real data have been performed in order to evaluate the differences between the proposed methodology and the plain use of the Maximum Variance algorithm. According to the results obtained, the proposal constitutes an efficient and accurate alternative.

Clustering high-dimensional datak-medoidsComputer scienceCURE data clustering algorithmSingle-linkage clusteringCanopy clustering algorithmVariance (accounting)Data miningCluster analysiscomputer.software_genrecomputerk-medians clustering
researchProduct

A Greedy Algorithm for Hierarchical Complete Linkage Clustering

2014

We are interested in the greedy method to compute an hierarchical complete linkage clustering. There are two known methods for this problem, one having a running time of \({\mathcal O}(n^3)\) with a space requirement of \({\mathcal O}(n)\) and one having a running time of \({\mathcal O}(n^2 \log n)\) with a space requirement of Θ(n 2), where n is the number of points to be clustered. Both methods are not capable to handle large point sets. In this paper, we give an algorithm with a space requirement of \({\mathcal O}(n)\) which is able to cluster one million points in a day on current commodity hardware.

CombinatoricsCURE data clustering algorithmSUBCLUNearest-neighbor chain algorithmCorrelation clusteringSingle-linkage clusteringHierarchical clustering of networksGreedy algorithmComplete-linkage clusteringMathematics
researchProduct

Efficient and Accurate OTU Clustering with GPU-Based Sequence Alignment and Dynamic Dendrogram Cutting.

2015

De novo clustering is a popular technique to perform taxonomic profiling of a microbial community by grouping 16S rRNA amplicon reads into operational taxonomic units (OTUs). In this work, we introduce a new dendrogram-based OTU clustering pipeline called CRiSPy. The key idea used in CRiSPy to improve clustering accuracy is the application of an anomaly detection technique to obtain a dynamic distance cutoff instead of using the de facto value of 97 percent sequence similarity as in most existing OTU clustering pipelines. This technique works by detecting an abrupt change in the merging heights of a dendrogram. To produce the output dendrograms, CRiSPy employs the OTU hierarchical clusterin…

Computer scienceCorrelation clusteringSingle-linkage clusteringMolecular Sequence DataMachine learningcomputer.software_genrePattern Recognition AutomatedCURE data clustering algorithmRNA Ribosomal 16SGeneticsComputer GraphicsCluster analysisBase Sequencebusiness.industryApplied MathematicsDendrogramHigh-Throughput Nucleotide SequencingPattern recognitionSignal Processing Computer-AssistedEquipment DesignHierarchical clusteringEquipment Failure AnalysisRNA BacterialCanopy clustering algorithmArtificial intelligenceHierarchical clustering of networksbusinesscomputerSequence AlignmentAlgorithmsBiotechnologyIEEE/ACM transactions on computational biology and bioinformatics
researchProduct

Clustering categorical data: A stability analysis framework

2011

Clustering to identify inherent structure is an important first step in data exploration. The k-means algorithm is a popular choice, but K-means is not generally appropriate for categorical data. A specific extension of k-means for categorical data is the k-modes algorithm. Both of these partition clustering methods are sensitive to the initialization of prototypes, which creates the difficulty of selecting the best solution for a given problem. In addition, selecting the number of clusters can be an issue. Further, the k-modes method is especially prone to instability when presented with ‘noisy’ data, since the calculation of the mode lacks the smoothing effect inherent in the calculation …

Computer sciencebusiness.industrySingle-linkage clusteringCorrelation clusteringConstrained clusteringcomputer.software_genreMachine learningDetermining the number of clusters in a data setData stream clusteringCURE data clustering algorithmConsensus clusteringData miningArtificial intelligenceCluster analysisbusinesscomputer2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)
researchProduct

Distance-constrained data clustering by combined k-means algorithms and opinion dynamics filters

2014

Data clustering algorithms represent mechanisms for partitioning huge arrays of multidimensional data into groups with small in–group and large out–group distances. Most of the existing algorithms fail when a lower bound for the distance among cluster centroids is specified, while this type of constraint can be of help in obtaining a better clustering. Traditional approaches require that the desired number of clusters are specified a priori, which requires either a subjective decision or global meta–information knowledge that is not easily obtainable. In this paper, an extension of the standard data clustering problem is addressed, including additional constraints on the cluster centroid di…

Fuzzy clusteringCorrelation clusteringSingle-linkage clusteringConstrained clusteringcomputer.software_genreDetermining the number of clusters in a data setSettore ING-INF/04 - AutomaticaData clustering k–means Opinion dynamics Hegelsmann–Krause modelCURE data clustering algorithmData miningCluster analysisAlgorithmcomputerk-medians clusteringMathematics22nd Mediterranean Conference on Control and Automation
researchProduct

Scalable Clustering by Iterative Partitioning and Point Attractor Representation

2016

Clustering very large datasets while preserving cluster quality remains a challenging data-mining task to date. In this paper, we propose an effective scalable clustering algorithm for large datasets that builds upon the concept of synchronization. Inherited from the powerful concept of synchronization, the proposed algorithm, CIPA (Clustering by Iterative Partitioning and Point Attractor Representations), is capable of handling very large datasets by iteratively partitioning them into thousands of subsets and clustering each subset separately. Using dynamic clustering by synchronization, each subset is then represented by a set of point attractors and outliers. Finally, CIPA identifies the…

Fuzzy clusteringGeneral Computer ScienceComputer scienceSingle-linkage clusteringCorrelation clusteringConstrained clustering02 engineering and technologycomputer.software_genreComputingMethodologies_PATTERNRECOGNITIONData stream clusteringCURE data clustering algorithm020204 information systems0202 electrical engineering electronic engineering information engineeringCanopy clustering algorithm020201 artificial intelligence & image processingData miningCluster analysiscomputerACM Transactions on Knowledge Discovery from Data
researchProduct

Paradigm of tunable clustering using Binarization of Consensus Partition Matrices (Bi-CoPaM) for gene discovery

2013

Copyright @ 2013 Abu-Jamous et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Clustering analysis has a growing role in the study of co-expressed genes for gene discovery. Conventional binary and fuzzy clustering do not embrace the biological reality that some genes may be irrelevant for a problem and not be assigned to a cluster, while other genes may participate in several biological functions and should simultaneously belong to multiple clusters. Also, these algorithms cannot generate tight cluster…

Fuzzy clusteringMicroarraysSingle-linkage clusteringGenes FungalGene Expressionlcsh:MedicineBiologyFuzzy logicSet (abstract data type)Molecular GeneticsEngineeringGenome Analysis ToolsYeastsConsensus clusteringMolecular Cell BiologyDatabases GeneticCluster (physics)GeneticsCluster AnalysisBinarization of Consensus Partition Matrices (Bi-CoPaM)Cluster analysislcsh:ScienceGene clusteringBiologyOligonucleotide Array Sequence AnalysisGeneticsMultidisciplinarybusiness.industryCell Cycleta111lcsh:RComputational BiologyPattern recognitionGenomicsgene discoveryPartition (database)tunable binarization techniquesComputingMethodologies_PATTERNRECOGNITIONGenesCell cyclesSignal Processinglcsh:QArtificial intelligencebusinessGenomic Signal ProcessingAlgorithmsResearch Articleclustering
researchProduct